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Abstract. In this paper, we investigate Gaussian process regression mod¬ 
els where inputs are subject to measurement error. In spatial statistics, 
input measurement errors occur when the geographical locations of ob¬ 
served data are not known exactly. Such sources of error are not special 
cases of “nugget” or microscale variation, and require alternative meth¬ 
ods for both interpolation and parameter estimation. Gaussian process 
models do not straightforwardly extend to incorporate input measure¬ 
ment error, and simply ignoring noise in the input space can lead to 
poor performance for both prediction and parameter inference. We re¬ 
view and extend existing theory on prediction and estimation in the 
presence of location errors, and show that ignoring location errors may 
lead to Kriging that is not “self-efficient”. We also introduce a Markov 
Chain Monte Carlo (MCMC) approach using the Hybrid Monte Carlo 
algorithm that obtains optimal (minimum MSE) predictions, and dis¬ 
cuss situations that lead to multimodality of the target distribution 
and/or poor chain mixing. Through simulation study and analysis of 
global air temperature data, we show that appropriate methods for in¬ 
corporating location measurement error are essential to valid inference 
in this regime. 

Key words and phrases: Gaussian processes, measurement error, Krig¬ 
ing, MCMC, geostatistics. 


1. INTRODUCTION 

Gaussian process models assume an output variable of interest varies smoothly 
over an input space (e.g., percipitation totals across geographical coordinates, 
crop yield across factor levels of an experimental design). Such models appear 
frequently in areas as diverse as climate science [Mardia and Goodall (1993)], 
epidemiology [Lawson (1994)], and black-box problems such as computer experi¬ 
ments, and Bayesian optimization [Sacks et al. (1989), Srinivas et al. (2009)]. See 
Stein (1999), Cressie and Cassie (1993), Banerjee, Carlin and Gelfand (2014) and 
Gelman et al. (2014) for more detailed treatments. 

Noisy spatial input data are common in many applications; for example, geo- 
statistical data is often imprecisely spatially referenced, “binned” to the nearest 
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latitude/longitude grid point, or referenced to maps with distorted coordinates 
[Veregin (1999), Barber, Gelfand and Silander (2006)]. Accounting for measure¬ 
ment error on covariates in the context of regression models is a well studied 
theme [Carroll et al. (2006)]; however, despite their importance in applications, 
surprisingly little work has been done on interpolation or Gaussian process re¬ 
gression problems in the presence of (spatial) location measurement error. As 
we show in this paper, Gaussian process models do not straightforwardly extend 
to incorporate input measurement error, and simply ignoring noise in the input 
space can lead to poor performance. 

Previous research on such error sources has mostly focused on demonstrat¬ 
ing their existence and quantifying their magnitude [Bonner et al. (2003), Ward 
et al. (2005)]. For regression problems, Gabrosek and Cressie (2002) (and later 
Cressie and Kornak (2003)) adjust Kriging equations for the presence of location 
errors, and Fanshawe and Diggle (2011) further develop research for this regime 
to include problems where the locations of future observations or predictions are 
subject to error. Location errors have also been studied in the context of point 
process data [Zimmerman and Sun (2006), Zimmerman, Li and Fang (2010)]. 

Properly accounting for location errors is essential for optimal interpolation 
and uncertainty quantification, as well precise and efficient parameter estimation 
when parameters of the covariance function are unknown. Using theoretical re¬ 
sults and extensive simulations, our paper provides guidelines on situations when 
location errors are most impactful for data analysis, and suggestions for incorpo¬ 
rating this source of error into inference and prediction. We expand the research 
in Cressie and Kornak (2003) on best linear unbiased prediction (Kriging) to in¬ 
clude procedures for obtaining interval forecasts and for quantifying the cost of 
ignoring location errors. We also discuss Markov Chain Monte Carlo (MCMC) 
methods for optimal (minimum mean squared error (MSE)) predictions, which 
average over the conditional distribution of (latent) location errors given the ob¬ 
served data. 

Section 2 establishes notation and describes the basic model with location 
errors used throughout the paper. In Section 3, we discuss Kriging using the 
covariance structure of the location-error induced process. Section 4 considers 
MCMC methods for obtaining minimum MSE predictions, and thus improving 
upon Kriging. We compare these methods through simulation study in Section 
5, and explore an application to interpolating northern hemisphere temperature 
anomolies in Section 6. The proofs of all of the theoretical results are given in an 
Appendix. 


2. THE MODEL 

We will write s n = (si S 2 ■ • • s n Y to denote a n-vector of locations in the 
input space S C M p , and x„ = (x(si) x(« 2 ) • ■ • x(s n ))' as the associated vector 
of observations at s n . Similarly, we will denote x| ; = (x(s][) ... x(sl))', or simply 
x* = x(s*) where {s*, i = 1,..., k} are unobserved locations. The process x : § —> 
M is called a Gaussian process if, for any si,... s n 6 §, x„ = (x(si).x(s 2 ) • • • x(s n ))' 
is jointly Normally distributed. Typically, the form of this joint distribution is 
specified by a deterministic or parametric mean function (for now, taken without 
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loss of generality to be 0) and a covariance function c : 8 2 —> M, so that 


(1) 



( x(si)^ 

~ AT 

f 

0, 

/c(si,si) • 

c(si , Si 

\x{s n )J 


1 

\c(s n ,si) 



For c to be a valid covariance function, the covariance matrix in Equation (1) 
must be positive semi-definite for all input vectors s n = (si s 2 ... s n )'. 

Gaussian process regression is primarily used as a method for interpolating 
(predicting) values x* k at unobserved points s* k = (s* ... s* k Y in the input space, 
given all available observations. Such conditional distributions are easily obtained 
by exploiting the joint normality of the response x at observed and unobserved 
locations: 


x£|x n ~ Af(C(s* k , s n )C(s n , s n ) 1 x n , 

(2) C(st4) - C(s k , s n )C(s n , s n )- 1 C(s n , s * k )). 

In Equation (2), C(s n , s n ) denotes the covariance matrix of x n , C(s£, s n ) denotes 
the k x n covariance matrix between x. k and x n . 

When the locations in the input space § are affected by error, we observe a 
surrogate process y : § —> M, 


y(si) = x(si + Ui ), 

where Si is a known location in § and m € S is unobserved location error. The 
problem of Gaussian process regression with location errors addressed in this 
paper is to predict x at unobserved (exact) locations x(s*) given observations 
from the noise-corrupted process y. 

When x is assumed to be a Gaussian process, there is no nontrivial struc¬ 
ture for u that results in y being a Gaussian process. Additionally, it is not 
possible to write y as a convolution of x and a white noise process as differ¬ 
ences between the surfaces y and x will generally be correlated across space, i.e., 
Cov[y(si) — x(si),y(s 2 ) — x(s 2 )] Y 0- Gaussian process regression with location 
errors therefore cannot be thought of as a classical or Berkson errors-in-variables 
problem [Carroll et al. (2006)]. Interestingly, in some cases, the process y may 
be more informative for prediction at a new location x(s*) than the process x is. 
Thus, appropriate methods can deliver lower MSE interpolations in a location- 
error regime than the MSE of the usual methods in an error-free regime. 

3. KRIGING THE LOCATION ERROR INDUCED PROCESS Y 

As shown in Cressie and Kornak (2003), the second moment properties of y 
can be used to perform Kriging (they named this “Kriging adjusting for location 
error” or KALE), noting that measurement errors u induce a new covariance 
function 


k(s 1 , s 2 ) = Cov[y(si),y(s 2 )\ = E[c(si + ui, s 2 + u 2 )\ for s\ / s 2 
k(s,s) =Y[y(s)] = E[c(s + u, s + u)] 

(3) k*(s,s*) = Cov[y(s),x(s*)] = E[c(s + u, s*)]. 
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The expectation here is taken over the input errors u, which are assumed to 
have some joint distribution g Sn . The following result shows that if c is a valid 
covariance function, then so is k, regardless of the error distribution g(-). 

Proposition 3.1. Assume for all n and s n E §, (ui, U 2 , ■ ■ ■, u n ) ~ g Sn E Q, 
where Q is any family of probability measures on S. Then k is a valid covariance 
function if c is. 

Regardless of the form of c, k always exhibits the “nugget” effect, or discontinu¬ 
ities in the covariance function [Matheron (1962)] lim S2 _>. Sl k(s 2 ,s i) ^ k(si,s\). 
In fact, several authors cite location/positional error as a justification for in¬ 
cluding a nugget term in an arbitrary covariance function c [Cressie and Cassie 
(1993), Stein (1999)], alongside independent measurement error in observing the 
response, x(s) + e. Location errors, however, cause k to differ from c throughout 
the spatial domain § 2 (this is shown in Figure 1), meaning that while they induce 
a nugget, a nugget term alone cannot capture the effect of location errors. 


(3 = 5 


P = 1 


P = 0.1 





Fig 1. Comparison of c and k for § = R 2 and c(si,S 2 ) = exp(—/3||si — S 2 II 2 ), with m ~ 
Af(0, cr 2 12 ). Location errors <r 2 > 0 cause c and k to differ as a function of distance, and induce 
a nugget discontinuity at 0. 


Using k, we get the Kriging estimator adjusting for location error for x(s*) at 
an unobserved location of x: 

(4) x KALE (s*) = K*(s*, s n )K(s n , s n ) _1 y n . 

In Equation (4), K and K* respectively denote the covariance matrices corre¬ 
sponding to the kernels k and k*. The quantity x K ale(s*) is the best linear 
unbiased predictor for x(s*) (in terms of MSE) and has all the usual Kriging 
properties. When there are no location errors, the Kriging estimator is equivalent 
to the conditional expectation of x(s*) given x„ (see Equation (2)). 

In general, the covariance functions k and k* can be evaluated using Monte 
Carlo integration by repeatedly sampling u n from g. For several common combi¬ 
nations of covariance function and location error models, however, it is possible 
to arrive at the expressions in Equation (3) in closed form. In particular, if 

c(si,s 2 ) = T 2 exp(-/3d(si,s 2 )), 
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then we can define a random variable Z = d(s\ + u \, S 2 + U 2 ) and find its moment 
generating function Mz(t). If we can evaluate Mz(t) at t = —/ 3 , then this yields 
k(s\, S 2 ). For instance, for the squared exponential covariance function d(s 1 , S 2 ) = 
||si — S 2 11 2 and Normal location errors u ~ Af(0, <r 2 Ip)> Z has a scaled noncentral 
Xp distribution and 

t(si '“ 2) = a + “p fr ^ 11 * 1 - 82 » 2 ) for 

(5) k(s, s ) = r 2 

with a similar expression for k*(s, s*). Thus the covariance function for y is also 
squared exponential (it is not generally true that c and k will share the same 
functional form). Note, however, that not all parameters are identifiable—we 
must know at least one of (t 2 ,/3,ct 2 ) in order to estimate the others. 

Interestingly, it is possible for the KALE to yield lower MSE predictions than 
those given from an error-free regime, where u„ = 0 and x = y. In other words, y n 
can be more informative than x n for predicting x(s*). Heuristically, this happens 
when y n is more strongly correlated with x(s*) than is x ra . Below we characterize 
the conditions for observing this phenomenon in a simple model with one observed 
data point (Figure 1 provides an illustration); it seems difficult to generalize this 
to larger observed location samples and covariance/error structures. 

Proposition 3.2. Assume n = 1 , ||s — s *|| 2 = A 2 , c(s,s*) = r 2 exp(—/3A 2 ) 
for all s,s* G §, and u ~ AA( 0 , (j, 2 I p ). Without location error (cr 2 = Oj, the MSE 
in predicting x(s*) from x(s) is Co = t 2 (1 — exp(— 2[3A 2 )). There exists a u > 0 
such that E[(x kale (s*) — x(s*)) 2 ] < co if and only if /3A 2 > p/2. 

3.1 Interval predictions 

For many applications of Gaussian process regression, particularly in geostas- 
tics and environmental modeling, both point and interval predictions are of inter¬ 
est. However, Kriging, being strictly a moment-based procedure, does not provide 
uncertainty quantification for predictions other than variance. In a location-error 
Gaussian process regime, KALE predictions will always be non-Gaussian, thus 
variance alone is not sufficient to provide distributional or interval predictions. 

However, it is relatively straightforward to derive confidence intervals for pre¬ 
dictions at unobserved locations x(s*) given measurements y n at locations s n . 
The following proposition provides the exact distribution function (CDF) for 
prediction errors x(s*) — x K ale (s*), which can be inverted to obtain a confidence 
interval for x(s*) based on x K ale('S*)- 


Proposition 3.3. Let 

V(u n ) = cr 2 + yC(s n + u n , s n + u„.)y - 27'C(s n + u n , s*) 
where 7 = K(s n , s n )^ 1 K*(s n , s*), cr 2 = V[x(s*)]. 


Then 

( 6 ) 


(x(s*) - x KA le(s*) < z) = E 


$ 


AW, 

where T is the CDF of the standard Normal distribution. 
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It may be necessary to evaluate Equation (6) using Monte Carlo; if so, it is 
practical to use the same draws of u ra when evaluating different quantiles z, as this 
guarantees a Monte Carlo estimate of the distribution function be non-decreasing. 

While intervals based on (6) provide exact coverage (modulo Monte Carlo 
error), such coverage is achieved by averaging over all data, both observed (y„) 
and unobserved (x(s*)) as well as the location errors, u n . This is in contrast to 
usual interval estimates from Gaussian process regression without location error, 
which are exact probability statements conditional on the observed data x n . The 
reason this is an important distinction is because the usual Gaussian process 
conditional probability intervals yield the proper coverage rate across multiple 
prediction intervals (when predicting x at a collection of unobserved locations 
s^), whereas the confidence intervals corresponding to KALE may not. 

3.2 Advantages over Kriging while Ignoring Location Errors 

Failing to adjust for location errors when Kriging (Cressie and Kornak (2003) 
called this “Kriging ignoring location errors” or KILE) can lead to poor perfor¬ 
mance. A data analyst ignoring the location errors will use (see Equation (2)) 

( 7 ) £kile(s*) = C(s*, s n )C(s n , s n ) - 1 y n . 

Since x K ale(s*) is the best linear unbiased estimator for x(s*) and x K ile is also an 
unbiased linear estimator, KALE dominates KILE and always yields a reduced 
MSE. Figure 2 illustrates the disparity in MSE for a simple model; intuitively, the 
relative cost of ignoring location errors increases as the magnitude of the location 
errors increases. We also see in panel (B), illustrating Proposition 3.2, that for 
small values of of ,. the MSE for both KALE and KILE decreases in 




(a) Locations at which we observe y(s), as (b) For both KALE and KILE, MSE ac- 
well as the location at which we wish to tually decreases as the magnitude of loca- 
predict x(s). Sample paths of Gaussian pro- tion errors increases when this magnitude 
cesses with this covariance function are also is small. Above a certain point, greater lo- 
shown. cation error yields higher MSE and greater 

disparity between KALE and KILE. 


Fig 2. Here we assume x(s) is a Gaussian process with mean 0 and covariance function 
c(s i,S 2 ) = exp(—(si — S2) 2 ), with Ui Af(0, cr 2 ). We compare MSE for predicting x(5) based 
on 1/(0),..., y(4) using KALE and KILE. 
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Besides yielding suboptimal predictions relative to KALE, ignoring location 
errors also leads to an estimator for x(s*) that is not self-efficient [p. 549, Meng 
(1994)]. Following Meng (1994), an estimator T for parameter 8 is self-efficient if 
for any A E [0, 1] and subset of the observed data X c C X , we have 

E[(AT(A') + (1 - A )T(X C ) - 8) 2 } > E[(T(AT) - 9) 2 }. 

Thus, roughly speaking, self-efficient estimators are those that cannot be im¬ 
proved by using only a subset of the original data [Meng and Xie (2014)]. 

The following theorem states that the KILE MSE is unbounded as a function 
of any single spatial location Sj for i = 1 ,n. This is a stronger result than 
just the lack of self-efficiency. A consequence of Theorem 3.4 is that, assuming 
only simple continuity conditions on the covariance function and location error 
model, the KILE MSE can always increase when observing more data, regardless 
of the locations of the existing observations or the locations at which we want to 
make predictions. 


Theorem 3.4. Suppose that the following conditions hold: 

• c is continuous and bounded in § 2 ; 

• the location error model g satisfies (vlfi.uff) (ui,U 2 ) for all s\,S 2 E § 
and sequences {sfi.s’ff) such that lim m _ >00 (s^ 1 , s™) = (si,S 2 )> 

• and P(iti 7 ^ U 2 ) < 1 for all si,S 2 E S. 

Let x K ile(s*) be the KILE estimator for x(s*) given y n . Then for any M > 0, 
n > 2, and S 2 , ■ ■ ■, s n E S, there exists si such that E[(.x(s*) — £ K ile(s*)) 2 ] > M. 


Note that the condition that c is continuous excludes a nugget term from the 
distribution of x. We prove Theorem 3.4 (in the Appendix) by showing that when 
observed locations are very close together, the corresponding covariance matrix 
is nearly singular, and this increases MSE. Without location errors, the usual 
Kriging estimator does not exhibit this behavior since the difference between 
values of x(s) for points that are close together also converges in probability to 
0. This is not the case for the noise-corrupted process, as y(s 2 ) — y(s\) does not 
converge to 0 . 



a^ = 0.01 



a> 

LU o 
</> 




Fig 3. Here we assume x(s) is a Gaussian process with mean 0 and covariance function c(s, s*) = 
exp(—(s — s*) 2 ) + a 2 l s =s*. Location errors have the form m ~ A/”(0,0.04). We use KILE to 
predict *(5) based on y 0 h s = {j/(0),... , 2/(4), ?/(6),..., ?/(10)}, as well as an additional observation 
V(s). The MSE in predicting x(5) given y 0 b s and y(s ) is plotted as a function of s, while the 
red line denotes the MSE based only on y 0 bs- Despite the magnitude of the location errors being 
relatively small, observing another measurement of y at some locations can increase (possibly 
dramatically) the MSE. 
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Simulation results suggest that even when c contains a nugget term <r 2 , KILE 
is still not self-efficient, and additional observations can increase MSE. Figure 
3 illustrates the change in MSE as a function of the location of an additional 
observation of y. Following Theorem 3.4, we see the MSE is unbounded when 
cr 2 = 0. But even when ct 2 = 1, it is possible for an additional observation to 
(slightly) increase MSE. 

3.3 Parameter Estimation for Kriging 

In typical applied settings, some or all parameters of the covariance function 
are unknown and must be estimated by the analyst in order to obtain Krig¬ 
ing equations. For Gaussian process models without a location error component, 
parameter estimation can be accomplished using likelihood methods. This can 
be computationally challenging for large data sets, as each likelihood evalua¬ 
tion requires a Cholesky factorization of the covariance matrix (or equivalent 
operations), which is 0(n 3 ) except in special cases. An alternative is to choose 
parameters by maximizing goodness of fit between the empirical variogram and 
the theoretical (parametric) variogram, though this is less efficient for parametric 
Gaussian models. 

Location errors present challenges for both such procedures as the covariance 
function for the observed provess y (3) may not be available in closed form, 
meaning neither the likelihood function or variogram can be evaluated exactly. 
While Monte Carlo methods surely offer effective approaches in theory [Fanshawe 
and Diggle (2011)], they muliply the computational expense of the problem, as 
each evaluation of the likelihood requires M matrix factorizations, where M is 
the number of Monte Carlo samples used to approximate the likelihood. Cressie 
and Kornak (2003) advocate a pseudo-likelihood procedure [Carroll et al. (2006)] 
that uses a Gaussian likelihood approximation based on the first two moments of 
V, 


( 8 ) L(0;y n ) oc |K e (s n ,s n )| 1/2 exp ^-^y^K 0 (s n , s n ) V„^ , 

where we write to explicitly mark the dependence of the covariance function k 
on unknown parameters 6. This pseudo-likelihood requires inverting K only once 
per pseudo-likelihood evaluation, even when K g is computed by Monte Carlo. 

We can work out inferential properties of the maximum pseudo-likelihood esti¬ 
mator 9 = argmax e L(0; y n ). First, it is straightforward to check that the pseudo¬ 
score pertains to an unbiased estimating equation: 

(9) E[5(0; yn)] = E[V log(L(0; y„))] = 0. 


Moreover, one can show the covariance matrix of the pseudo-score is given by 


G(6) =E[S(0;y n )S(0;y n )'] 


G(% =E 


-Tr{OjC 6 i(un)O j C 0 (u n )} 


(10) + \ (E[Tr{aC e (un)}Tr{%C 0 (u„)}] - TV{^K4TY{^K4) , 
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using the notational abbreviations Cg(u n ) = Ce(s n +u n , s n +u n ), Kg = Ke(s n , s n ) 
E[Cg(u n )], and Q,; = K^ 1 g'j K^ 1 . Lastly, the expected negative Hessian 

of the log pseudo-likelihood is 


( 11 ) 


H(0)ij = E 


c> 2 

dQidOj 


log(L(0;y n )) 


1 

2 


Ti^KeQjKe}. 


If there are no location errors (u n = 0), L is an exact likelihood and the second 
term in the right hand side of Equation (10) vanishes so that G(8) = H(8), 
confirming the second Bartlett identity [Ferguson (1996)]. For non-zero location 
errors, however, we construct the Godambe information matrix as an analog to 
the Fisher information matrix [Varin, Reid and Firth (2011)], 

m = H{e)[G(e)]- l H{6)- 

Evaluating 1(8) for different location error models illustrates the information loss 
in estimating covariance function parameters 6 relative to the error-free case, 
where 1(9) = G(9) = H(9) is equivalent to the Fisher information matrix. 

General theory of unbiased estimation equations [Heyde (1997)] suggests the 
asymptotic behavior of the pseudo-likelihood procedure satisfies 

(12) I(e)V 2 ($-0) 4 A7(0,I). 


However, Expression (12) does not hold in general even in an error-free regime 
u n = 0, as asymptotic results for Gaussian process covariance parameters depend 
on the spatial sampling scheme used and the specific form of the covariance 
function [Stein (1999)]. We nevertheless expect (12) to hold for suitably well- 
behaved processes under increasing-domain asymptotics. Guyon (1982) gives an 
applicable result when locations s n are on a lattice. We are not aware of other 
theoretical results in this context. 


4. MARKOV CHAIN MONTE CARLO METHODS 

Markov Chain Monte Carlo methods offer an alternative to Kriging for predic¬ 
tion in a regime with noisy inputs. They allow us to compute the MSE-optimal 
prediction 

x(s*) = E[x(s*)|y n ] 

(13) — (C(s , S n + lln) [C(s n + U n , S n + Un)] yVi) |y n ) G?U n , 

which will dominate the KALE estimator (4) in terms of MSE for any model 
and set of observed and predicted locations. The optimality of x(s*) in (13) 
is due to the fact that the conditional mean E[x(s*)|y n ] obtains the minimum 
MSE for any estimator of x(s*) that is a function of y n . This estimator is not 
linear, and MCMC methods are necessary for evaluating (13) as the density 
for the conditional distribution 7r(u n |y n ) will not be available in closed form 
(no possible “conjugate" form for the distribution of u n is known to us). When 
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model parameters, such as in the covariance function c or the distribution of 
u are unknown, the distribution 7r(u n |y n ) implicitly averages over the posterior 
distributions of such parameters. 

MCMC methods also allow us to compute prediction intervals (zi ow , z w h) such 
that P(z low < x(s*) < Zhighlyn) = 1 — ol. When the covariance function c and 
location error model g are known, these intervals are exact probability condi¬ 
tional probability statements, providing a stronger coverage guarantee than that 
achieved with the KALE procedure in Proposition 3.3, where coverage is achieved 
only by averaging over y n . 

4.1 Distributional Assumptions 

MCMC inference for (13) requires the assumption that x n is Gaussian. While 
this is a common assumption in practice and has been assumed throughout this 
paper, it is not necessary to derive the KALE equations and their MSE (but it 
is necessary to produce coverage intervals as in Proposition 3.3). Thus, Kriging 
approaches, including KALE, are attractive when there is information about the 
joint distribution of x beyond its first two moments. 

In this scenario, however, we can still advocate—from a decision-theoretic 
perspective—a Gaussian assumption when the goal of the analysis is minimum 
MSE prediction. Specifically, let 7r G L1o,c be a choice of joint distribution for x n 
with the appropriate first two moments 0 and C. The minimum MSE prediction 
of x(s*) assuming x n ~ n is the conditional mean E 7r [x(s*)|x n ], Let Rtt 0 (tt) be the 
risk (MSE) of this minimum MSE predictor under the assumption that x ri ~ ir 
when in fact x n ~ ttq] that is, 

RttoM = [(E 7 r [a?(s*)|x„] - x(s*)) 2 }. 

Thus R nQ (w) represents the risk (MSE) in a misspecihed joint distribution for 
x ra , where the analyst assumes x n ~ ir, but in fact x n ~ ttq. We then have the 
following proposition, based on Theorem 5.5 of Morris (1983): 

Proposition 4.1. Let -kq G Llo.c be Gaussian. Then for all n G Llo,c we 
have 

RttM < Rvr^o) = RttoCtTo) < RttoM- 


Unlike traditional decision theory problems, here we are fixing the estimator 
(Kriging), and considering the costs of different distributional assumptions (7r). 
Given that the analyst has decided to use Kriging for predicting x(s*), then the 
risk in making an incorrect distributional assumption is R 7ro (7To) — R 7r ('/ro) = 0. 
This reflects the fact that the Kriging MSE depends only on the first two mo¬ 
ments of 7r. However, there is an “opportunity cost” in making any non-Gaussian 
assumption R 7r (7To) — R^tt) > 0 for 7r / 7To, which represents the reduction in 
MSE under 7r that could be achieved by using a different estimator other than 
Kriging. 

Obviously, if there is a strong reason to believe a non-Gaussian n is true, then 
analysis should proceed with this assumption, ideally leveraging an estimator that 
is optimal under these assumptions (instead of Kriging). However, without strong 
distributional knowledge, the analyst can assume Gaussianity without risking 
increased MSE or paying an opportunity cost for using an inefficient method. 
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4.2 Hybrid Monte Carlo 

Hybrid Monte Carlo [Neal (2005)] is well-suited for the problem of sampling 
7r(u n |y n ) oc 7r(y n |u n )7r(u n ) in order to evaluate (13). This is because while 
7r(u n |y n ) is computationally expensive (requiring inversion of the covariance ma¬ 
trix C 0 (u n ) = C 0 (s„ + u n , s n + u n )), the gradient V log(7r(u n |y n )) is a relatively 
cheap byproduct of this calculation. Often the conditional distribution u n |y n is 
correlated across components, making gradient-based MCMC methods more ef¬ 
ficient for generating samples. Other gradient-based MCMC sampling methods, 
such as the Metropolis-adjusted Langevin algorithm [Roberts et al. (2001)] and 
variants, may also be well-suited to this problem. 

Bayes rule provides tt(9, u n |y n ) oc 7r(y n |0, u n )i t(6, u n ), where 9 here represents 
any unknown parameter(s) of the covariance function c. In most situations it 
will be reasonable to assume u n and 9 are independent a priori—this is trivially 
true in the case that 9 is assumed known. Assuming this, and recognizing that 
^(ynl^jUn) is Gaussian, we can write the log posterior and its gradient: 


log(vr(6», u n |y„)) = -^ log(|C 0 (u„)|) - ^y' n C e (u n ) 1 y n + const. 


d 

dui 

1 

2 


log(vr(6»,u„|y n )) = 


^TV [ C^u,)- 1 i-Ce{n n ) 


d_ 

dui 


d 

— log(7r(6»,u n |y n )) = 


1. 


TV Cfl(u„) _1 


d 

dOi 


C e (u n ) 


\ Q 

{C e {\i n y l y n y' n log(vr(u n )) 

(C e (u„)- 1 ynyn - In)^ ^ log(vr((9)). 


The computational cost of both the likelihood and gradient are dominated 
by solving C 0 (u n ) ( e.g ., Cholesky factorization), which is 0(n 3 ). Every likeli¬ 
hood evaluation computes this term, which can then be re-used in the gradient 
equations. Thus, the computational cost of computing both the likelihood and 
gradient remains 0(n 3 ). 

4.3 Multimodality 

The posterior distribution tt(9, u n |y n ) is often multimodal, more so if the dis- 
trubution 7 r(u n ) is diffuse. This is because if there is a local mode at (9, u n ), there 
may be a local mode at any (9 , u n ) such that C@(u n ) = Cg( u n ), as the likelihood 
is constant for such (0,u n ). In particular, for isotropic covariance models, the 
likelihood is constant for additive shifts in u„, or rotations of s n + u ri , as these op¬ 
erations preserve pairwise distances. Additionally, multimodality can be induced 
by the many-to-one mapping of the set of true locations {sj + Ui, i = 1,..., n} to 
the set of observed locations {si,i = 1,... ,n}. For instance, with n = 2 and an 
isotropic covariance function, for any choice of u\,U 2 we get the same likelihood 
with u\ = S 2 T U 2 — si and U 2 = s\ + u\ — S 2 ■ Moreover, for fixed u n , for many 
common covariance functions it is possible for the posterior of 9 to be multimodal 
[Warnes and Ripley (1987)]. 

HMC (and other gradient MCMC methods) can efficiently sample from mul¬ 
tiple modes, however this becomes difficult when the modes are isolated by re¬ 
gions of extremely low likelihood [Neal (2011)]. Isolated modes can occur in the 
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location-error GP regime. For example, assume one-dimensional locations (p = 1) 
and an isotroptic covariance model with known parameters 6 and nugget 

Marginally, as ||si + ui — (s 2 + '« 2 )|| -> 0, y\ — j /2 —t A7(0, 2 0 ^); that is, the scaled 
difference \yi — 2 / 2 ]/('Ll) must be reasonably small. When this is not the case 
(e.g. = 0), then the log-likelihood asymptotes at si + u\ = S 2 + U 2 almost 

surely. Thus, the Markov chain can only sample u n such that the ordering of 
{si + Ui,i = 1,..., n} is preserved. Note that when p > 1, while the log-likelihood 
may still asymptote at si + u\ = S2 + U2, this no longer constrains the space of 
u n (except on sets with measure 0). 



-4 -2 0 2 4 -4 -2 0 2 4 -0.5 0.5 

Ui u, ih 

(A) al = 2, a1 = 0.0001 (b) a 2 u = 2, a 2 = 1 (c) a 2 u = 0.1, a 2 x = 0.0001 

Fig 4 . Density of (ui, W2) using the covariance function c(s 1, S2) = exp(—(si — S2) 2 ) +<t 2 1 s i=s 2 . 
We simulate data (j/i, t/ 2 ) using si = 0, si = 1, and m ~ Af(0,a 2 ), and different values of o 2 
and o' 2 . 


Figure 4 demonstrates the modal behavior for this simple example with p = 1 
and n = 2. When location errors are large in magnitude and the nugget tern 
(j^ is small, the modes of ( 111 , 112 ) are separated by a contour of near 0 density 
(panel A). A higher nugget a ^ increases the density between the modes, making 
it easier for the same MCMC chain to travel between them (panel B). Decreasing 
the magnitude of the (Gaussian) location errors, cr^, puts more mass on a single 
mode, as the unimodal distribution 7r(u n ) has a greater influence on 7r(u n |y n ) 
(panel C). 

Thus, as with any MCMC application, for the location-error GP problem it is 
advisible to run separate chains in parallel, with different, diffuse starting points, 
and monitor mixing diagnostics [Gelman and Shirley (2011)]. Multiple chains fail¬ 
ing to mix is likely a symptom of multiple isolated modes, in which case we should 
modify the HMC algorithm to include tempering [Salazar and Toral (1997)] or 
non-local proposals that allow for mode switching [Qin and Liu (2001), Lan, 
Streets and Shahbaba (2013)]. Another strategy to overcome multiple isolated 
modes is importance sampling: as Figure 4 shows, increasing the nugget vari¬ 
ance increases the density between modes. If we generate samples according 
to 7f(#,u n |y n ) oc 7f(y n |0, u n )7r(0)7r(u n ) where 7f(y n |0, u n ) is the density corre¬ 
sponding to AA(O,C 0 (u ra ) + rel n ) for some fixed k, then it is straightforward to 
compute importance weights 7 r( 0 , u n |y n )/7f(0, u n |y n ). This is because Ce(u n ) _1 
is easy to compute from (Cg(u n ) + «:I n ) _1 (and vice versa) using the Woodbury 
formula. Either standard importance sampling, or Hamiltonian importance sam¬ 
pling [Neal (2005)], could be used to generate parameter estimates, point/interval 
predictions, and any other posterior estimates of interest. 
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Parameter 

Values simulated 

Prior support 

T 2 

1 

(0, 10 ) 

P 

0.001, 0.01, 0.1, 0.5, 1, 2 

(0.0005,3) 


0.0001, 0.01, 0.1, 0.5, 1 

(0,10) 

<?u 

0.0001, 0.01, 0.1, 0.5, 1 

(0,10) 


Table 1 

Parameter values used in simulation study. The range (0.0005,3) for /3 guarantees that at least 
one pair of points among our observed data has a correlation in the range (0.05,0.95). This 
eliminates modes corresponding to white noise processes from the likelihood surface. 

5. SIMULATION STUDY 

We compare Kriging (both KALE and KILE) and HMC methods for point/interval 
forecasts for Gaussian process regression in a simulation study. For various com¬ 
binations of parameter values for the covariance function c(si,S 2 ) and location 
error model g(u) we simulate observations y n where y* = x(si + Ui) and make 
predictions for values of x at unobserved locations: x* k = (x(s\) ... x(s k ))'. 

We simulate data using the squared exponential covariance function c(s i, S 2 ) = 

t 2 exp(— /3||si — S 2 1| 2 ) + cr 2 lsi=s 2 an d an i.i.d. Gaussian location error model u t 
N(0, cr 2 I p ). The squared exponential covariance function and Gaussian location 
error model combine to form a convenient regime, as we can evaluate k in closed 
form (5). Without loss of generality, we can use t 2 = 1 for all simulations as 
it is simply a scale parameter. We consider a p = 2 dimensional location space, 

Si 6 M 2 . On a 8 x 8 grid, we randomly select 54 locations at which we observe y, 
and target the remaining 10 locations for interpolating x. Figure 5 illustrates a 
range of data samples for processes used in our simulations on this space, while 
Table 1 provides a full summary of all the parameter value combinations we 
consider. Data from each parameter combination is simulated 100 times. 



2468 2468 2468 

Fig 5. Samples of x(s) for different values of the length-scale parameter fd with the squared 
exponential covariance function, c(si,S 2 ) = exp(—/3||si— S 2 || 2 )+cUl sl=S2 • Black points are where 
we have observed y(s) and white points are where we wish to predict x(s). Observed/predicted 
locations were randomly sampled from an 8x8 grid. 

We evaluate the three prediction methods—KALE, KILE, and HMC—using 
both adjusted root mean squared error (R.MSE) and the coverage probability of 
a 95% interval. “Adjusted” RMSE is based on the MSE with a 2 subtracted out, 
as this term appears in the MSE for any prediction method. For every parameter 
combination of interest used, these statistics are calculated first by averaging over 
each of the k = 10 prediction targets in each simulated draw of new data, and 
then over the J = 100 independent data draws. 
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Both evaluation statistics can be evaluated more precisely during simulation 
by utilizing a simple Rao-Blackwellization. For iteration j. instead of drawing x £ 
in addition to y n and calculating rmse^ = || x £ — xf \\/k, we simply condition on 
the simulated location errors u n to get rrnsej = E[||x^,— x* k \\/k \ y n , u n ]. Similarly, 
to calculate coverage of an interval (L s *(y n ),U s *(y n )) for x(s*), for iteration j 
we use 

1 . k' 

cov ^' = ^ G ( L a *( y n ), U s *( y n ))] | y n , u n ]. 

2—1 

HMC is implemented using the software RStan [Stan Development Team (2014)], 
which implements the “no-U-turn” HMC sampler [Homan and Gelrnan (2014)]. 
10000 samples were drawn during each simulation iteration, which (for most pa¬ 
rameter values) takes a few minutes on a single 2.50Ghz processor. 

5.1 Known covariance parameters 

We first simulate point and interval prediction for KALE, KILE, and HMC 
using the same parameter values that generated the data. By doing so, we leave 
aside the issue of parameter inference and simply compare the extent to which 
different methods leverage the information in the location-error corrupted data 
y n to infer x(s*). Figure 6 compares RMSE for the three methods when there is 
a very small nugget, cr^ = 0.0001. 


Relative RMSE for KALE/KILE Relative RMSE for HMC/KALE 
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(a) RMSE ratio of KALE to KILE. (b) RMSE ratio of HMC to KALE. 

Fig 6. Relative RMSE of KALE and KILE (A) and HMC and KALE (B) for each combination 
of parameters (/3, a„) indicated , and <7% = 0.0001. Blue shading represents a relative decrease in 
RMSE while red shading represents a relative increase in RMSE. 

We can see that there is little difference among the three methods when cr^ is 
sufficiently small (0.0001), or when f3 is sufficiently large (2). This makes sense, 
as in the former case, with small location errors the potential improvement over 
KILE (which is exact for a\ = 0) is negligible, and in the latter case, observations 
are too weakly correlated for nearby points to be informative. Larger values of 
give KALE a significant reduction in RMSE versus KILE, with the reduction 
as large as 79% for the case of large magnitude location errors ( a ^ = 1) and a 
moderately smooth signal (/3 = 0.1). 

The idea of a moderately smooth signal requires further elaboration: for a 
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given cr^, when x is very smooth ((3 very small), the process is roughly constant 
within small neighborhoods, meaning y(s) ~ x(s ) and location errors are less of 
a concern for accurate inference and prediction. On the other hand, when (3 is 
very large and the process is highly variable in small regions of the input space, 
location errors are less of a concern because there is very little information in 
the data to begin with. Location errors are most influential when the process x 
has more moderate variation across neighborhoods corresponding to the plausible 
range of the location errors. 

HMC offers further reductions in RMSE over KALE in roughly the same re¬ 
gions of the parameter space in which KALE improves over KILE, although 
the additional improvement is less dramatic. The maximum RMSE reduction we 
observe is about 28%, once again for a moderately smooth signal with larger 
magnitude location errors. 


Relative RMSE for KALE/KILE Relative RMSE for HMC/KALE 
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(A) RMSE ratio of KALE to KILE. (b) RMSE ratio of HMC to KALE. 

Fig 7. Relative RMSE of KALE to KILE (A) and HMC and KALE (B) for each combination 
of parameters (/3, cr^) indicated, and a ^ = 0.1. Blue shading represents a relative decrease in 
RMSE while red shading represents a relative increase in RMSE. 

When the nugget variance a% is increased (Figure 7 shows results for a\ = 0.1), 
differences in RMSE among the three methods become smaller (the differences 
are wiped out entirely at a 2 r = 1, which is not pictured). This is not due to a 
shared a^. term in the RMSE value for all methods, as this is subtracted out. 
Rather, the similarity of all three methods reflects the fact that a larger nugget 
leaves less information in the data that can be effectively used for prediction. 
However, the differences that we do observe (both comparing KALE to KILE 
and HMC to KALE) occur primarily when the magnitude of location errors a\ 
is large. 

In the case where all parameters are fixed and known, both KALE and HMC 
produce intervals with exact coverage (subject to Monte Carlo or numerical ap¬ 
proximation errors) in all simulations. KILE, however, can severely undercover 
in the presence of location errors. Figure 8 shows coverage as low as 4% when the 
magnitude of the location errors is high (<r)j = 1), {3 = 0.1, and the nugget vari¬ 
ation is minimal ( a ^ = 0.0001). Undercoverage still persists in this region of the 
parameter space for = 1, the largest nugget variance used in our simulations. 
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95% interval coverage for KILE 95% interval coverage for KILE 
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(A) al = 0.0001. (b) al = 1 

Fig 8. 95% interval coverage for KILE for = 0.0001 (A) and = 1 (B). With moderately 
smooth signals and large location errors, we see severe undercoverage that does not disappear 
even for = 1. 


5.2 Unknown covariance parameters 

In typical applied settings, the analyst will not know model parameters such 
as those of the covariance function (t 2 ,/3), the nugget variance cr 2 , or even the 
variance of the location errors a 2 . Due to identifiability issues with our choice of 
covariance function in this simulation (5), we assume cr 2 is known but estimate 
all other parameters before making predictions at unobserved locations. 

For KILE and KALE, parameter estimation is accomplished through maxi¬ 
mum (pseudo-) likelihood, as in (8). Parameter estimates are then plugged into 
Kriging equations (4)-(7) to obtain corresponding point and interval estimates. 
Because c and k are both squared exponential (5), the pseudolikelihood estimation 
procedure estimates the same covariance function for y, however the estimated 
parameters (and therefore Kriging equations, based on k*) will differ. The plug-in 
approach ignores uncertainty in parameter estimates, so plug-in MSE estimates 
will be too optimistic. Various techniques exist for adjusting MSE from estimated 
parameters [Smith (2004), Zhu and Stein (2006)], though there is no need to in¬ 
corporate such techniques into our analysis since exact (up to Monte Carlo error) 
MSEs are provided by simulation. 

For HMC, we supply unknown parameters with prior distributions and sample 
parameters and predictions jointly from the posterior distribution n(6, x£|y n ). 
The priors we use are flat over a reasonable range (see Table 1), which guarantees 
both a proper posterior and a posterior mode that agrees with the maximum 
likelihood estimate of 6. This second condition supports fair comparisons between 
predictions derived from HMC parameter estimates versus those based on the 
maximum (psueolikelihood) parameter values. 

Figure 9 provides the relative RMSE of KALE vs. KILE, and HMC vs. KALE, 
for predictions when parameters must first be estimated (using cr 2 = 0.0001). 
We notice that there does not appear to be a great advantage in KALE over 
KILE when parameters are first estimated. This is because, as mentioned earlier, 
the marginal process y still has a squared exponential covariance function 5, so 
Kriging equations for KALE and KILE will be very similar. On the other hand, 
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Relative RMSE for KALE/KILE Relative RMSE for HMC/KALE 
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(a) RMSE ratio of KALE to KILE. (b) RMSE ratio of HMC to KALE. 

Fig 9. Relative RMSE of KALE and KILE (A) and HMC and KALE (B) for each combination 
of parameters ((3, a\) indicated, and o\ = 0.0001. Parameters are assumed unknown and first 
estimated to obtain point predictions. 

we notice a modest improvement when using HMC over Kriging, except in a small 
region of the parameter space (<r„ < .01 and /3 6 [0.5,1]). 
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(a) RMSE ratio of KALE to KILE. (b) RMSE ratio of HMC to KALE. 

Fig 10. Relative RMSE of KALE and KILE (A) and HMC and KALE (B) for each combina¬ 
tion of parameters (73, a\) indicated, and = 0.1. Parameters are assumed unknown and first 
estimated to obtain point predictions. 


When the nugget variance is increased to = 0.1, we see the results in Figure 
10. We still see relatively similar performances from KALE and KILE. HMC 
offers a small improvement over KALE when /3 > 0.01, though for f3 = 0.001 we 
actually see significantly higher MSEs with HMC. At f3 = 0.001 the process is 
extremely smooth, as the most distant pairs of observations still have a correlation 
of 0.88. We are thus more concerned with overestimating (3 than underestimating 
it; as the former shrinks predictions towards 0 while the latter shrinks towards 
(approximately) the mean of all observations. As we use a flat prior for (3, where 
almost all mass is located (3 > .001, the posterior tends to overestimate /3, leading 
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to draws with relatively high MSE. 


95% interval coverage for KILE 95% interval coverage for KILE 
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(A) al = 0.0001 (b) cr^ — 1 

Fig 11. 95% interval coverage for KILE for = 0.0001 (A) and = 1 (B). 
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(A) al = 0.0001 (b) al = 1 

Fig 12. 95% interval coverage for KALE for a\ = 0.0001 (A) and = 1 (B). 


Neither Kriging or HMC guarantees prediction intervals with the correct cover¬ 
age in the regime where parameters must first be estimated (though HMC would 
give proper “Bayes coverage” when simulating 0 according to the prior used). We 
nevertheless present coverage results in Figures 11-13. While we do not expect 
any method used to provide exact coverage, Kriging (both KALE and KILE) suf¬ 
fer from significant undercoverage for some regions of the parameter space, while 
HMC is consistent in offering at least 85% coverage throughout our simulations. 
In a regime without location errors, Zimmerman and Cressie (1992) advocate 
Bayesian procedures under non-informative priors over frequentist procedures in 
order to obtain interval estimates with good coverage; our simulation results, 
albeit in the context of location errors, agree with this finding. 
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95% interval coverage for HMC 
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Fig 13. 95% interval coverage for HMC for cl = 0.0001 (A) and cl = 1 (B). 


5.3 Summary 

Our simulation results confirm the theoretical guarantee of KALE dominating 
KILE in prediction MSE when the covariance function is known, and further¬ 
more HMC dominating KALE. The magnitude of differences in MSE between 
these methods is greatest when the process is moderately smooth relative to the 
spatial sampling ( e.g ., 0.01 < j3 < 0.5), when the magnitude of location errors 
is largest, and when nugget variation a\ is smallest. For such regions of the 
parameter space, KILE fails to deliver prediction intervals with proper coverage, 
whereas KALE and HMC can give valid prediction intervals for any parameter 
values. 

An important consequence in adjusting for location errors with a known co- 
variance function is the corresponding adjustment to the nugget. The discussion 
in (Sections 3.6 and 3.7 of) Stein (1999) emphasizes the importance of correctly 
specifying the high-frequency behavior of the process when interpolating (cor¬ 
rectly specifying the low-frequency behavior is less crucial), including the nugget 
term. Estimating parameters, including the nugget term o\. implicitly corrects 
for model misspecification when ignoring location errors. Thus we see little dif¬ 
ference in predictive performance between KALE and KILE when parameters 
are first estimated. Depending on the choice of prior, KALE/KILE may give 
lower MSE predictions than HMC, which averages over posterior parameter un¬ 
certainty; however, interval coverage is better for HMC (using weak prior infor¬ 
mation) than for KALE/KILE. 

6. INTERPOLATING NORTHERN HEMISPHERE TEMPERATURE 

ANOMOLIES 

To illustrate the methods discussed in this paper, we consider interpolating 
northern hemisphere temperature anomolies during the summer of 2011 using the 
publicly available CRUTEM3v data set 1 [Brohan et al. (2006)]. Figure 14 shows 
our data. These data are used in geostatistical reconstructions of the Earth’s 


1 http://www.cru.uea.ac.uk/cru/data/temperature/ 
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temperature field, which interpolate temperatures at unobserved points in space- 
time in order to better understand the historical behavior of climate change (see, 
e.g., Tingley and Huybers (2010) and Richard et al. (2012)). Each observation 
is a spatiotemporal average: temperature readings are averaged over the April- 
September period and each 5° x 5° longitude-latitude grid cell. These values are 
then expressed as anomolies relative to the global average during the period 1850- 
2009, which is calculated using an ANOVA model [Tingley (2012)]. Apart from 
this spatiotemporal averaging, numerous other preprocessing steps adjust this 
data for differences in altitude, timing, equipment, and measurement practices 
between sites, along with other potential sources of error; please see Morice et al. 
(2012) and Jones et al. (2012) for more details. 

Our analysis, restricted to interpolating a single year of data, and without using 
external data such as temperature proxies [Mann et al. (2008)], is intended as a 
proof of concept rather than as a refinement or improvement to existing analyses 
of these data. We wish to illustrate the potential impact of location errors on 
conclusions drawn from these data. 



Fig 14. CRUTEM3v data for summer 2011, with 2011 mean subtracted so that measurements 
represent spatial anomolies. Generally speaking, we see lower (cooler) anomolies in North Amer¬ 
ica and positive (warmer) anomolies in Europe. Higher latitudes also tend to have positive 
anomolies. 


The “gridding”, or spatial averaging across 5° x 5° cells, complicates analyses 
using Gaussian process models [Director and Bornn (2015)]. However, assuming 
a smooth temperature held, we know that the recorded spatial average must be 
realized exactly at some location in each grid box (closer to the center if a lot of 
points have been averaged together). This frames the spatial averaging problem 
as a location measurement error problem: instead of observing the temperature 
x(s) at each grid center s, we observe the temperature at an unknown location 
displaced from the grid center: y(s ) = x(s + it). 

Following Tingley and Huybers (2010), we assume an exponential covariance 
function for x(s), where distance is calculated along the Earth’s surface. As s is 
given in terms of longitude/latitude (s = (V’, </>)), this has the form 

c(si, s 2 ) = t 2 exp(-/3A) + (t 2 1 Si=S2 

. 2 {<fo 


where r = 6371 is the radius of the earth (in km). At higher latitudes (</>), the 
centers of each grid cell are closer together, so nearby observations are more 
strongly correlated. The nugget term a 2 represents some combination of mea¬ 
surement error in temperature readings and high-frequency spatial variation that 


- (/>i 


+ cos(^i) cos((/> 2 ) sin 2 


V> 2 - Vh 


(14) 


A = 2 r 


arcsm - 
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is inestimable using the gridded observation samples. 

We assume the following model for location errors iq, which are additive dis¬ 
placements of longitude/latitude coordinates s, = (Vh ; < /h) : 

(15) ”))' 

This prior is equivalent to assuming that distance along the Earth’s surface (great- 
circle distance) between each grid center and the corresponding observation loca¬ 
tion has a scaled chi distribution, d(si, Si + Ui) ~ <J U X- Combining (15) and (14), 
we use Monte Carlo to compute k. 

We treat parameters t 2 ,/3,ct 2 as unknown, but fix cr 2 = 7500. At this value, 
the median magnitude of the location errors in great-circle distance is 102km, 
which is consistent with analyzing the coordinates of the temperature recording 
sites used to compile the CRUTEM3v data 2 . 

6.1 Kriging 

We first apply Kriging approaches to interpolate the CRUTEM3v data, both 
adjusting for and ignoring location errors (15). Because parameters t 2 ,/3,ct 2 are 
unknown, we first need to estimate them using maximum likelihood (when ig¬ 
noring location errors) or maximum pseudo-likelihood (8) (when adjusting for 
location errors). These can then be plugged in to covariance functions c and k to 
obtain “empirical” Kriging equations we can use for interpolation [Zimmerman 
and Cressie (1992)]. 

We find small differences in parameter estimates when ignoring location errors 
(assuming a 2 = 0) and adjusting for them (assuming cj 2 = 7500): 


2 

f 2 

P 

-2 

G X 

0 

1.1671 

1.4275 x 10~ 4 

0.0747 

7500 

1.1649 

1.4677 x 10 -4 

0.0699 


Table 2 

Covariance function parameter estimates when ignoring location errors (assuming cr 2 = 0) and 
adjusting for location errors (assuming a 2 = 7500/. 

Consequently, when we interpolate data at the centers of grid cells for which no 
data was observed, we see differences between the KALE and KILE approaches. 
Figure 15 shows the differences between KALE and KILE interpolations (both 
point and interval estimates). Relative to the range of the data (most anomolies 
are in the interval (—1,1)), the discrepency between KALE and KILE does not 
seem very significant. 

6.2 HMC 

Using HMC, parameter inference and interpolations are made simultaneously. 
The resulting point and interval predictions differ substantially from the Kriging 
results. However, because HMC incorporates parameter uncertainty in predic¬ 
tions, this comparison is not sufficient to illustrate the impact of location errors 
on conclusions from this data. A more appropriate comparison is between HMC 

2 Station locations are vieweable at https://www.ncdc.noaa.gov/oa/climate/glicn-daily/ 
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Fig 15. Kriging results for interpolating temperature anomolies from summer 2011. The top plot 
shows interpolations at unobserved grid centers given by KALE. The bottom left plot shows the 
difference in estimates between KALE and KILE ('KALE - KILE,), and the bottom right plot 
shows difference in 95% interval widths between KALE and KILE. 


with a location error model (cr^ = 7500), and HMC assuming with no location 
errors (cr^ = 0). These results are plotted in Figure 16. 
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Fig 16. Results for interpolating temperature anomolies from summer 2011 using HMC. The 
top plot shows interpolations at unobserved grid centers, assuming location errors = 7500. 
The bottom left plot shows the difference in estimates between the location error model and the 
model with a j) = 0. The bottom right plot shows difference in 95% interval widths. 


Using HMC, accounting for location errors produces more significant differ¬ 
ences in inference/prediction than was observed for Kriging. This is particularly 
true for interval predictions, where adjusting for location errors yields intervals 
as much as 0.1 wider, which is a significant discrepency when most observations 
lie in (—1,1). 
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Figure 17 shows posterior densities for unknown parameters of the covari¬ 
ance function based on HMC draws from the a 2 = 7500 and = 0 models 
(the Kriging estimates of these parameters are vertical lines). HMC under loca¬ 
tion error model (a 2 = 7500) gives slightly larger (3 estimates than when using 
a 2 = 0, meaning observations are inferred to be less strongly correlated. This 
yields prediction intervals that tend to be wider (see Figure 16). The most ex¬ 
treme descrepencies occur in the arctic, where distances between grid points are 
closest. The fact that modeling location errors adds additional uncertainty to arc¬ 
tic predictions is of particular interest to climate scientists, as accurate climate 
reconstruction for the arctic region is essential for understanding recent climate 
change patterns [Cowtan and Way (2014)]. 



Fig 17. Density of posterior draws from HMC using a J = 7500 (blue) and = 0 (red). Point 
estimates of these parameters from Kriging (Table 2) are shown as vertical lines. 


The difference between predictions obtained under the a 2 = 0 and a 2 = 7500 
models using HMC suggests that modeling location errors, even when they are 
small in magnitude, meaningfully impacts parameter estimates and predictions 
at unobserved locations. The fact that results for HMC (assuming a 2 = 7500) 
also differ from the results using KALE, while the KILE results do so less, demon¬ 
strates that moment procedures such as Kriging may be ineffective in adjusting 
for these errors. 


7. CONCLUSION 

In this paper, we have explored the issue of Gaussian process regression when 
locations in the input space § are subject to error. Even when location errors 
are quite small in magnitude, it is essential to adjust Kriging equations in order 
to obtain good point and interval estimates; further improvements can be made 
by using MCMC to sample directly from the distribution of the measurement of 
interest given the sampled data. 

Both MCMC and Kriging will be infeasible for large data sets, due to the cost 
of the covariance matrix inversion. A useful future study would be to adapt the 
procedures discussed in this paper to methods for inference and prediction for 
large spatial data sets, such as the predictive process approach [Banerjee et al. 
(2008)], low rank representations [Cressie and Johannesson (2008)], likelihood 
approximations [Stein, Chi and Welty (2004)], and Markov random held approx¬ 
imations [Lindgren, Rue and Lindstrom (2011)]. It will also be useful to extend 
the analysis of this paper to regimes where location errors may be correlated 
with the process of interest x. For example, in climate data, regions with extreme 
climates will be harder to sample, thus there may be greater error in the spatial 
refencing of such sampling than for regions that are easier to sample. 
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APPENDIX A: PROOFS OF RESULTS 
A.l Proof of Proposition 3.1 

k is a valid covariance function if and only if for all n, s n , and {a* 6 M, i = 
1,..., n}, we have 

n n 

EE aiajk{si , Sj ) > 0 . 

i=i j =i 

From (3), this condition can be rewritten: 


n n n n » 

Yj a iajk(si, Sj) = ^2 a i a j / c(si+Ui,Sj + Uj)dg Sn (u n ) 

: , , I j=l 

n. n n 

= / Y Y cuajC^Sj + Uj, Sj + Uj)dg Sn ( u n ) 

J § :_1 :_-i 


i=l j=l 


i=l i=l 


As c is a valid covariance function, the integrand in this expression is always 
non-negative, so the integral is also non-negative. Thus k is a valid covariance 
function. 

Note that for the common scenario where location errors are independent, so 
that g Sn is a product measure g Sl x ... x g Sn , then Proposition 3.1 is a special 
case of kernel convolution [Rasmussen (2006)]. 

A.2 Proof of Proposition 3.2 

Without loss of generality, we can assume r 2 = 1 and fix /?, A > 0. Using the 
fact that k*(s, s*) = E[exp(—/?||s + u — s*|[ 2 )], evaluating the moment generating 
function of a non-central Xp random variable ||s + u — s*|| 2 yields 

c(al) = E[(W»*) - X (s-)?\ = 1 - ( T -T- I ) P exp (5=^5) ■ 
Differentiating, we get 



2l3[2l3{p a l-^)+p] ( -2/1A 2 \ 

(1 + 2 «P+2 eXP Vl + 2/J<r;h 


If /?A 2 < p/2, then (/(cr 2 ) > 0 for all a 2 > 0. Since c(cr 2 ) is left continuous at 0, 
continuous on M + , and c(0) = cq, this means /3A 2 < p/2 implies c(c 2 ) > co for 
all a 2 . 

Otherwise, if /3A 2 > p/2, then for all 0 < u 2 < ^ we have c'(<7 2 ) < 0. 

Once again, because c(<7 2 ) is left continuous at 0, continuous on M + , and c(0) = Co, 
this means c(<r 2 ) < Co for cr 2 in this interval. 

A.3 Proof of Proposition 3.3 


Let W = x(s*) — x KALE (s*). We can explicitly write the dependence of W on 

W\u n ~ Af(0, V(Un)) 


U ( u n ) = o - 2 + yc(s n + u n , s n + u n )7 - 27'C(s n + u n , s*), 
7 = K ( s n , s n ) _ 1 K *( s n , s*), 


where 
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and cr 2 = V[x(s*)]. Thus 


P (W <z) = E[P(VF < z|u n )] 


A.4 Proof of Theorem 3.4 


= E 


<L 



First, our assumptions in the hypothesis imply that k is continuous everywhere 
in § 2 except where si = sy. To see this, take any distinct si, S 2 E § and sequence 
sf,sjp converging to (si,S 2 ). The sequence cfs™ + u ™, s™ + u™) is bounded 
and converges in distribution to c(si + u\,S 2 + u-})- Thus, by the Dominated 
Convergence theorem, A:(s™, s™) —> k(s\, S 2 ). 

Now, for any n E N, the KILE MSE in predicting x(s*) given y n is 

E[(x(s*) - x K ile(x*)) 2 ] = V[x(s*)} - 2C(s*, s n )C(s n , s n )- 1 K*(s n , s*) 

(16) + C(s*, s„)C(s n , s n ) _1 K(s n , s n )C(s n , s n ) _1 C(s n , s*). 


The matrix C(s„, s n ) is symmetric and positive definite, and thus it can be written 
as C(s n ,s n ) = QAQ', where Q is an orthogonal matrix and A is diagonal. 
Assume without loss of generality the entries are A are ordered 0 < Ai < ... < 
X n . Similarly, write K(s n ,s„) = RfiR/. Further letting a = C(s*,s n )Q, b = 
Q'K(s n , s*), and D = Q'RIT , we can write (16) as 

E[(x(s*) - x KILE (x*)) 2 ] = Y[x(s*)} - 2a'A _1 b + a'A^DD'A^a 


(17) 


= V[x(s*)]-2^ 
2—1 


Q*ibi 


A i 



2 


Let £ = Aithen Equation (17) can be expressed as 


(18) 


E[(x(s*) 


®kile(£*)) 2 ] = £ 2 



+ M£)j 


where h is linear in £. 

Without loss of generality, assume si — > S 2 - Thus C(s n , s„) becomes rank n— 1, 
with Ai — > 0 and for all i > 1, Aj —> X* > 0. Thus £ —> 00 . However, K(s n ,s n ) 
does not become singular, since P(ui / u?) < 1 implies 


lim k(si,s 2 ) 7^ V[y(s 2 )]. 

SI ^S2 

Since k is continuous and K(s n ,s n ) nonsingular in the limit, all of the terms 
besides £ in (17) converge as si —> sy, that is a —> a*, b —> b*, and D —> D* as 
si — > S 2 • Moreover, we cannot have D\- = 0 for all j, as this contradicts K(s n , s n ) 
remaining full-rank. Lastly, since C(s*,s n ) 7 ^ 0 and Q is orthogonal, cr 7 ^ 0 and 
a* / 0 for all i = 1 ,..., n. 

Thus the quadratic coefficient in (18), Y2'j=i a iD 2 j is strictly positive, and 
h(£) = 0(£). Because £ —> 00 , we get 

lim E[(x(s*) - Xkile(^*)) 2 ] = 00 . 

SI >S2 

For pathological choices of g Sn where k is not continuous everywhere and limits 
for b and D may not exist, all components of these terms can be still be bounded, 
which is sufficient for Theorem 3.4 to hold. 
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A.5 Proof of Proposition 4.1 

Bayes rule predictors by definition satisfy R^vr) < R 7r (7r), which confirms 
the two inequalities in the statement of Proposition 4.1. The equality R vr (7To) = 
Rvrolvro) holds since the risk of the Bayes estimator under 7To is a quadratic form, 
and therefore constant for all n £ n 0 , c : 

Rtto^o) = E 7 r [(E 7 ro [a:(s*)|x n ] - x(s*)) 2 ] 

= E 7 r [(C(s*, s n )C(s n , s n ) _ 1 x n - icfs*)) 2 ] 

= c(s*, S *) - C(s*, s„)C(s n , s n ) - 1 C(s n , s*) 

= R 7r (7T 0 ). 
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